DOCUMENT RESUME 



ED 395 028 



TM 025 046 



AUTHOR 

TITLE 



INSTITUTION 
REPORT NO 
PUB DATE 
NOTE 
PUB TYPE 



Braun, Henry I.; Wainer, Howard 

Making Essay Test Scores Fairer with Statistics. EIS 
Program Statistics Research Technical Report No. 
89-90. 

Educational Testing Service, Princeton, N.J. 

ETS-RR-89-43 

Oct 89 

14p. 

Reports - Evaluative/Feasibility (142) 



EDRS PRICE 
DESCRIPTORS 

IDENTIFIERS 



MFOl/PCOl Plus Postage. 

*Essay Tests; Estimation (Mathematics) ; *Interrater 
Reliability; ^Scoring; ^Statistical Analysis; Trend 
Analysis ; *True Scores 

^'Advanced Placement Examinations (CEEB) ; 

*Calibrat ion; Fairness; Variability 



ABSTRACT 

A desirable goal would be to develop a methodology 
for scoring essays so that the final grades are less affected by when 
or by whom each essay was read. It seems sensible to derive such 
grades by somehow adjusting the ratings originally given by each 
reader. This essay describes a solution that relies on statistical 
adjustment, using the context of the College Board's Advanced 
Placement program. Nonstatist ical provisions, such as rater training, 
are in place to minimize the potential impact of rater differences on 
grades, but there is no simple way of getting a true score for an 
essay. The basic idea in using statistical thinking to help is to 
reduce the effect on scoring reliability of some of the sources of 
variability through calibrating readers and days on which essays are 
read. Estimating the relative stringency of raters and the scoring 
trends across time is made possible by the choice of experimental 
design developed by statisticians. An example illustrates the 
approach. Calibration experiments on five different Advanced 
Placement examinations showed that, in general, calibrated scores 
enhance reliability, but there are obstacles to overcome before the 
approach can be operationalized with actual essays. (Contains three 
tables and three references.) (SLD) 
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INTRODUCTION 

As the graded tests were handed back, a crescendo of groans echoed through 
the classroom. After the initial shock was registered, the long-suffering teacher 
smiled benignly and stated, “Your poor performance, relative to previous classes, 
indicates that this form of the test was more difHcult than I had anticipated. 
I’ll have to curve the scores.” The students’ relief was palpable. 

This sort of scene is common. “Curving the scores” is the transformation of 
the usual rules of correspondence between percent correct and its associated 
letter grade. In classroom tests the effect of curving almost always allows a 
score to qualify for a higher grade than would ordinarily be expected. While 
almost everyone knows this, the question of why teachers grade on a curve 
is shrouded in mystery. The answer, in its simplest terms, is that we curve (adjust) 
test scores to allow fairer comparisons among individuals who take different 
forms of the test. 
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A similar problem of adjusting test scores for fairness occurs in the subjec- 
tive scoring of essays. When a large collection of essays is to be graded, it is 
common to engage a number of individuals to carry out the scoring, with a 
different sample of essays assigned to each reader. The difficulty of an essay 
question involves both the inherent difficulty of the question (for example: 
“Describe your activities over the Christmas holidays” versus “Compare Kant s 
metaphysics with Aristotle’s”) and the strictness of the reader who scores it. 
We can control differences of the first kind by asking everyone the same ques- 
tions, but practical considerations prevent us from using the same control or 
the readers. Yet, if one reader has more stringent criteria than the others, those 
examinees who were unfortunate enough to have their exams assigned to this 
reader (analogous to being assigned a more difficult test form) are at a disa 
vantage. Fairness requires that these differences be removed (transforming/curv- 
ing the ratings of the readers so that they are comparable). Readers’ criteria 
may also shift through time; they might be more lenient on Monday than on 
Friday. If such variability exists, fairness requires that these day-to-day dif- 
ferences also be removed. 

A desirable goal, then, is to develop a methodology for scoring essays so 
that the final grades are less affected by when or by whom each essay was read. 
It seems sensible to derive such grades by somehow adjusting the ratings 
originally given by each reader. The rest of this essay describes one solution 
that relies on statistical adjustment. The solution is described in the context 
of a testing program that includes an important essay componrn:, the College 
Board’s Advance Placement (AP) Program. 



THE ADVANCED PLACEMENT PROGRAM 

The AP Program offers specialized curricula in a wide variety of subjects, in- 
cluding English, American history, European history, mathematics, biology, 
chemistry, French, and German. High school students who participate m the 
program and who do well in the final examination are eligible to receive col- 
lege credit for their work. In each subject, the same final examination is given 
all across the United Sutes on a particular Saturday in May. Each examination 
has a section of multiple-choice questions and a section of free-response ques- 
tions In mathematics or chemistry, free-response questions require the student 
to work out solutions to problems, whUe in English or American history they 

require the student to write essays. 

The answers to the free-response questions must be scored by human 
since computer programs are not yet intelligent enough to read students’ hand- 
writing and to assign values to the material. Because tens of thousands of stu- 
dents may write an essay on a given topic, the grading process involves bring- 
ing together as many as a hundred readers to grade papers continuously for 
four or five days. The readers include both high school teachers of AP courses 
and college teachers of those subjects. Each essay (or problem) is read by only 
one reader, chosen at random from the pool of available readers. He or she 
assigns a grade that becomes part of the toul score. 



PROBLEMS WITH SCORING ESSAYS 



The question of whether a student is unfairly advantaged (or disadvantaged) 
by having his or her essay read by one particular reader rather than another, 
is a critical issue. Readers, being human, will differ in their judgments of the 
quality of a particular essay and so the score assigned to that essay will depend 
to some extent on the “luck of the draw.” This dependence on chance is 
undesirable and should be eliminated to the extent feasible. 

Before we can act to eliminate this variability we have to understand how 
it can arise. First, different readers may have different scales for scoring. That 
is, two readers may agree on how to rank a set of papers but one might system- 
atically assign higher grades than the other. Second, two readers may assign 
the same scores on average but generally disagree on which essays deserve high 
grades and which low. In practice, both kinds of discrepancies, as well as others, 
will occur to some extent. 

Because the grading process extends over a number of days, the score assigned 
to an essay may also depend on when it is graded. There may be, for example, 
a general trend to grade more leniently (or more stringently) over the course 
of the week. Beyond this general trend, individual readers will exhibit their 
own trends through time. Such global patterns in assigned scores have nothing 
to do with the quality of the essays. If these patterns exist, they also contribute 
to the role that chance plays in the grade assigned to a student’s essay. 

Nonstatistical provisions are currently in place to minimize the potential im- 
pact of these factors on the grades. The AP Program carefully trains readers 
before the scoring sessions begin and continuously monitors them during the 
sessions. For each subject, a chief reader with several years experience in the 
program is appointed to take responsibility for the integrity of the scoring pro- 
cess. Soon after the answer booklets are returned, the chief reader selects a 
number of essays to illustrate different levels of the score scale. After extensive 
discussions with the senior readers and, eventually, with all the readers in the 
pool, the chief reader constructs a detailed list of criteria. Adherence to this 
“rubric” is monitored by periodically asking all the readers to grade the same 
paper. If substantial discrepancies occur, the readers undergo further training. 
This approach seems to work reasonably well but, as we shall see, there re- 
mains room for improvement. 

Before we go on to discuss how statistical thinking can help, we must have 
some way of measuring how well a suggested approach succeeds in reducing 
the role of chance in grade assignment. This will provide us with a >'ardstick 
by which to judge the effectiveness of a new method. 



HOW WELL ARE WE DOING? 

Unfortunately there is no simple way of getting a “true score” for an essay, 
so we cannot simply compare the assigned score with “truth ’ and use the dif- 
ference as an indication of the influence of chance. If an essay were read by 

all the readers in the pool, then the average of these scores could be used in 

>■1 

I 



Braun, Watner; Making Essay Test Scores Fairer With Statistics 

place of a true score. It would be impractical, however, to obtain so m\ny read- 
ings except for a very few essays. 

Scientists who study test scores have followed a rather different strategy. They 
judge the merit of a scoring procedure by applying it twice to a large sample 
of essays and assessing the agreement between the two sets of results. In the 
case of AP, they might select a sample of 500 essays for the “experiment.” Each 
essay would then be scored twice — each time on a day and by a reader chosen 
at random. Using the first set of scores, the essays would be listed from high 
to low. A second ranking would be obtained from the second set of scores. If 
the role of chance is relatively small, then an essay should fall at about the same 
place in the two lists. But if chance makes a large contribution, then the two 
rankings will differ considerably. 

The level of agreement between rankings is usually measured by a quantity 
called the reliability coefficient. The reliability coefficient is a number that 
is calculated from the numerical information contained in the two lists. In this 
setting, it can range from near 0 to near 1. If there is little agreement between 
the two lists, the coefficient will take on a value near 0, indicating that chance 
is playing a substantial role in the grading process. On the other hand, if there 
is substantial agreement between the lists, the coefficient will take on a value 
near 1, indicating that chance is playing a minor role. 

Typical values of the reliability coefficient for essay scores in the humanities 
are between .3 and .6. For problems in chemistry, the reliability coefficient 
usually lies between .6 and .8. To get some feeling for what these numbers mean, 
consider the following findings. If a group of boys are ranked by height at age 
six and then again at age ten the reliability coefficient for the two lists is greater 
than .8. If the boys are ranked by performance on an objectively scored in- 
telligence test at two different ages, the reliability coefficient is usually greater 
than .6. Finally, if boys of the same age are ranked once by height and again 
by performance on an intelligence test, the reliability coefficient for the two 
relatively unrelated lists is usually only about .2 or .3. 

It is not unusual (such as in the study described below) to have more than 
just two rankings of a set of essays. In situations like this we can reduce the 
many rankings to just two by simply choosing any two at random and calculating 
the reliability as before. Later, when we talk of the reliability of a particular 
scoring procedure, we will be referring to a measure that is closely akin to a 
pairwise reliability averaged over all pairs of judges. 




CALIBRATING READERS AND DAYS 

We are now ready to see how statistical thinking can help. The basic idea is 
to reduce the effect on scoring reliability of some of the sources of variability 
we have mentioned: systematic differences between readers or between days. 
By that we mean the following: if we knew, for the same set of papers, that 
one reader would assign scores that were on average 10 points higher than 
another reader’s, we could adjust the first set of scores by subtraaing 10 points 
from each of them (or by adding 10 points to each of the scores in the second 
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set). The two sets of scores would thus have the same average. This is as it should 
be since they refer to the same set of papers. 

Exactly the same sort of adjustments could be used to deal with systematic 
differences between days. If a set of papers graded on one day received scores 
that were on average, say, 5 points lower than they would receive on another 
day, we could add 5 points to the first set of scores to make the averages equal. 
The process of making averages equal is called calibration. In the context of 
essay scoring, calibrating both readers and days would improve the reliability 
of the scores by eliminating two sources of chance variation. The degree of 
improvement would depend on, among other things, how large these differ- 
ences were in the first place. 



COLLECTING DATA 

Where are we to obtain the information that we need to carry out the calibra- 
tion? In the operational grading, each paper is only read once — by a particular 
reader on a particular day. If readers assign different average grades over the 
course of the five-day grading period, we do not know whether to attribute 
those differences to real and consistent differences among the readers, or to 
differences in the quality of the essays they happened to read, or both. To make 
some progress, we will have to collect specialized data that will give us the in- 
formation we need. 

Statistical theory can guide us to the design of an experiment that will effi- 
ciently collect those data and tell us how to use them appropriately. Consider 
the following experiment. Suppose we choose a small sample of essays at ran- 
dom from among the pool of tens of thousands available and arrange to have 
each essay read by each reader on each day. The data thus obtained would allow 
us to estimate average differences among readers as well as average differences 
among days. (We use the term estimate because we would have observed the 
grading behavior of the readers only for the sample and not for all the essays.) 

We could use these estimates, obtained from this small sample of essays, to 
calibrate readers and days. That means we could adjust the scores for the en- 
tire pool of essays, by whomever and whenever they were graded, based on 
the information collected in the experiment. But before we do that we have 
to consider carefully the quality of the information we would be using. 

This experiment presents at least two problems. Because of the enormous 
number of readings that have to be carried out, there is a severe restriction on 
the number of extra readings that can be added for the experiment. Since each 
reader is to read each essay on each day, the number of essays has to be kept 
very small — say, five to ten. This raises questions about the representativeness 
of the results: Would we get substantially different estimates if we chose another 
set of five essays? A second issue arises from the repeated readings of the essays. 
To the extent that readers remember the score they assigned to an essay on the 
previous day and just copy it, we arc not collecting bona fide information. Such 
distortions in the estimates could result in our making adjustments in the wrong 
direction, so that calibrations would lower reliability rather than raise it! 
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STATISTICS TO THE RESCUE 



Our aim is to estimate the relative stringency of the different readers as well 
« e scoring trends across time without encountering the pitfaUs mentioned 

o ? I ^ lot of effort to solving problems 

of this sort. They have developed special methods for efficiently colleaing data 
caUed experimental designs. An example of a design that meets our needs is 
contained in Table 1. The table represents a set of instructions for aUocating 
readings or a four-day experiment involving 12 readers and 32 essays chosen 
at random from the pool. (One of the reasons that the numbers 12 and 32 were 
chosen is that they are both divisible by 4, the length of this particular experi- 
ment; other combinations are possible.) Each of the 32 rows corresponds to 
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Tabic 1 Allocation plan of essays to readers 




•The entries in the able indicate the day that reader Kored that essav 
*Rows I7-52 are just duplicates of rows I*i6. 
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a different essay, while each of the 12 columns corresponds to a different reader. 
The numbers in each row of the table indicate which readers are assigned to 
read that essay on that day. For example, reader 1 grades essay 16 on day 2. 

This design calls for each of the 32 essays to be scored three times each day, 
for 96 readings altogether (Note that if each reader were required to score every 
essay each day within an overall limit of 96 readings, only 8 essays could be 
included in the experiment. By relaxing this requirement, we are able to employ 
four times as many essays.) The allocation of readers to essays is not done in 
a haphazard way. In fact, there is a delicate choice of reader-essay combina- 
tions that enables us to obtain estimates of systematic differences among readers, 
even though no two readers read exaaly the same set of papers. 

Over the course of this four-day experiment, each reader will read each of 
the 32 essays exactly once. Consequently, there are no repeat reading, or carry- 
over, effects to worry about. Since each essay is read three times each day and 
each reader reads eight essays each day, we can also obtain estimates of 
systematic differ nces between days. Because our estimates are based on a sam- 
ple of 32 essays, rather than the eight essays that would be the limit with a 
complete design involving the 96 readings, they should be more representative 
as well. With the particular design we have chosen, it is even possible to make 
useful comparisons between readers on a day-by-day basis. 



SOME RESULTS 

To get a flavor of the results, we present the findings of one such experiment 
carried out for an essay question in English Literature and Composition for 
which scores were on a scale of 100 (low) to 900 (high). Table 2 shows for each 
day the average scores assigned to essays graded on that day as well as the dif- 
ferences between these day averages and the overall average for the entire ex- 
periment. On day 1, for example, the average score was 490, which is 7 points 
higher than the overall average of 483. 

Ideally the day averages should be very similar and indeed they are in this 
case. (The largest difference among days is 12 points. This is less than 3% of 
the average score in the experiment.) But this means that there is very little 
to be gained in trying to adjust for systematic differences among days — there 
just aren’t any! 



Table 2 Daily averages and their deviations from the mean 



Day 


Day Average 


Day Average 
Minus 

Experiment Average 


1 


490 


7 


2 


479 


-4 


3 


478 


-5 


4 


485 


2 


Experiment Average 


483 
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On the other hand, Table 3 presents the average score assigned by each reader 
over the course of the experiment. To see the substantial differences more 
clearly, we also show the differences between these reader averages and the 
overall average of 483. Reader A, the most lenient reader, typically scored essays 
82 points higher than the average while reader L, the most stringent reader, 
typically scored essays 58 points lower than the average. Remember these are 
just estimates of the differences in scoring levels between readers based on 32 
readings. Nonetheless, they have considerable credibility because through our 
design we have been able to balance out sources of variation that could other- 
wise degrade the estimates. 

It certainly appears as if, for this question at least, days don’t matter much 
but readers do. We have to remember, though, that there are three times as many 
scores contributing to a day average as there are contributing to a reader average. 
Accordingly, some proportion of the greater variability we observe among reader 
averages (as compared to day averages) may be due to the vagaries of chance. 
However, we can capitalize on the features of this particular design and the 
methods of statistical hypothesis testing to properly compare the relative varia- 
tion of readers versus days. When we do, we find that our first, naive impres- 
sions are justified: readers matter much more than days. 

To carry out the calibration, then, we subtract 82 from all the essays graded 
by A, subtract 6l points from all the essays graded by B, and so on. We can 
judge the effectiveness of the procedure by comparing the reliability of the 
original scores with that of the adjusted scores. The former is .57 and the lat- 
ter is .61, a difference of .04. That doesn’t sound like a great improvement for 
all that effort. The following calculations may help put the gain in some 
perspective. 

By using some mathematical analysis it is possible to show that if each essay 
had been read independently by two readers and the average of the two scores 



Tabic 3 Reader averages and their deviations from the mean 



Reader 


Reader Average 


Reader Average 
Minus 

Experiment Average 


A 


565 


82 


B 


544 


61 


C 


517 


34 


D 


506 


23 


E 


487 


4 


F 


484 


1 


G 


416 


-7 


H 


473 


- 10 


1 


454 


-29 


J 


432 


-51 


K 


432 


-51 


L 


425 


-58 


Experiment Average 


483 





To ease comprehension of this able, readen ha\'e been ordered by the average grade that they 
assigned. 



used as the final score, then the reliability of these averaged scores would be 
about . 73 . (Obtaining multiple readings is the standard way of improving re- 
liability.)Our gain of .04 is 25 % of .16 = .73 - .57, the gain in reliability possi* 
ble with double reading. 

Remember that with the information gleaned from this little experiment we 
can adjust the scores of the entire coUeaion of essays submitted. Our data have 
been obtained at a small fraction of the cost of hiring enough extra readers to 
double read the tens of thousands of essays on hand. We estimate the cost faaor 
will typically be about one-thirtieth. Since we have achieved one-quarter of 
the gain at one-thirtieth the cost, a cost/benefit analysis would yield a factor 
of seven or eight in favor of the calibration approach. This means that if it cost, 
say, J5,000 to run the experiment, it would have cost about *150,000 to hire 
enough readers for a complete double reading, and so one-quarter of that 
amount (*37,000) would be required to achieve the same gain in reliability. This 
suggests that using calibration should be seriously considered. 



SHOULD CALIBRATION BE USED? 

Calibration experiments have now been carried out on five different AP ex- 
aminations. In general, calibrated scores exhibit enhanced reliability — especially 
when the reliability of the original scores is on the low side to begin with. In 
one case the estimated reliability of the calibrated scores actually exceeded the 
projected reliability of double reading! The obvious success of such an experi- 
ment, however, is not sufficient to guarantee the operational implementation 
of the procedure. There are many other issues to be addressed. 

One such issue arises because the experiment we have described requires 
considerable planning and analysis. We have also investigated another calibra- 
tion procedure that can better be adapted to the tight time schedules that must 
be met in reporting scores to candidates. It is one we previously mentioned; 
namely, to make the adjustments on the basis of the operational grades. For 
example, if the average grade assigned by a reader over the entire grading period 
was 10 points higher than the average grade for all readers, we would then sub- 
tract 10 points from all the scores that grader assigned. A potential difficulty 
with this approach is that this reader, by chance, may have been assigned essays 
that were typically better than average and deserved the higher scores. In that 
case, an adjustment of 10 points would be too large. 

In practice, the essay booklets undergo various stages of haphazard shuffling 
before landing on a reader’s table. Unfortunately we have no direct way of deter- 
mining whether readers typically receive representative (truly random) samples 
of essays. But this is precisely where our experiment plays an important role. 
We can compare the calibration using the operational scores and the calibra- 
tion using the results of the experiment (in which the sample of essays is con- 
trolled and the randomization is carefully executed). When we do, we find the 
results are very much the same. This gives us confidence in the simpler, cheaper 
method. 
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It is worth pointing out that the data collected in an experiment such as the 
one v/e have described can lead to insights that are just not available from the 
operational data. Using methods of analy'sis that are too technical to be described 
here, we can learn more about the relative contributions of readers and days 
to score unreliability— and do it in a way that facilitates comparisons across 
different tests. We can also estimate the upper limit of reliability that can be 
achieved through calibration. This gives us a meaningful target to shoot for. 

In addition to considerations of feasibility, we also have to take into account 
the possible reactions of both students and schools to the notion of statistical 
adjustment of scores. Since the first phase of this research has clearly established 
that a statistically designed experiment can make the process of grading essays 
more fair, it only remains to iron out these other aspects before adopting its 
use widely. As this essay is being written (December 1987) this decision pro- 
cess is under way and may be operationally in place as you read about it. 



PROBLEMS 

1. Does training of essay raters yield the result that all readers will score the 
same essay identically? Why or why not? 

2. Would you expect essay readers to change their scoring scale over the course 
of the week? 

3. Why is calibration of essay readers necessary? 

4. Why can’t we just have all essays read by several readers? 

5. What is the advantage oi using the complex experimental design in Table 1 
rather than just having all experimental essays read by each of the readers 
on each of the days? 

6. How much accuracy is gained by adjusting for differences in reader 
performance? 
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